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Abstract 

In  this  paper  we  aim  for  segmentation  and  classification 
of  objects.  We  propose  codemaps  that  are  a  joint  formu¬ 
lation  of  the  classification  score  and  the  local  neighbor¬ 
hood  it  belongs  to  in  the  image.  We  obtain  the  codemap 
by  reordering  the  encoding,  pooling  and  classification  steps 
over  lattice  elements.  Other  than  existing  linear  decompo¬ 
sitions  who  emphasize  only  the  efficiency  benefits  for  lo¬ 
calized  search,  we  make  three  novel  contributions.  As  a 
preliminary,  we  provide  a  theoretical  generalization  of  the 
sufficient  mathematical  conditions  under  which  image  en¬ 
codings  and  classification  becomes  locally  decomposable. 
As  first  novelty  we  introduce  £2  normalization  for  arbitrar¬ 
ily  shaped  image  regions,  which  is  fast  enough  for  semantic 
segmentation  using  our  Fisher  codemaps.  Second,  using 
the  same  lattice  across  images,  we  propose  kernel  pooling 
which  embeds  nonlinearities  into  codemaps  for  object  clas¬ 
sification  by  explicit  or  approximate  feature  mappings.  Re¬ 
sults  demonstrate  that  £ 2  normalized  Fisher  codemaps  im¬ 
prove  the  state-of-the-art  in  semantic  segmentation  for  PAS¬ 
CAL  VOC.  For  object  classification  the  addition  of  nonlin¬ 
earities  brings  us  on  par  with  the  state-of-the-art,  but  is  3x 
faster.  Because  of  the  codemaps’  inherent  efficiency,  we  can 
reach  significant  speed-ups  for  localized  search  as  well.  We 
exploit  the  efficiency  gain  for  our  third  novelty:  object  seg¬ 
ment  retrieval  using  a  single  query  image  only. 


1.  Introduction 

It  remains  remarkable  that  the  great  successes  in  object 
recognition  use  so  little  of  the  spatial  order  in  the  image. 
FeatureFTIOl  are  encoded  in  feature  spacIHlPhTB,  H6, 15k 
pooled  in  histogram0JQ5]  and  plugged  into  a  kernel- 
classifieFW061.  The  entire  chain  contains  no  more  spatial 
information  then  the  locality  of  the  features,  compensated 
by  a  rather  crude  method  of  spatial  pvramidETTIZl  8]  where 
the  standard  classification  procedure  is  repeated  over  upper 
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Figure  1.  Codemaps  segment,  classify  and  search  objects  locally 
by  reordering  the  encoding,  pooling  and  classification  steps  of  ob¬ 
ject  classification.  Different  from  existing  linear  decompositions 
for  specific  pipelines,  codemaps  are  generic,  embed  fast  £2  nor¬ 
malization,  include  nonlinearities  by  local  kernel  pooling  and  al¬ 
low  for  segment  retrieval  using  a  single  query  image  only. 

and  lower  parts  of  the  image.  To  make  progress  in  recog¬ 
nition,  better  inclusion  of  more  spatial  information  is  in¬ 
evitable.  And  indeed,  recentlCTIl!0[091  have  introduced 
the  spatial  coherence  of  objects,  by  confining  the  usual  anal¬ 
ysis  to  selected  windows  in  the  image.  The  spatial  restric¬ 
tion  has  had  a  positive  effect  on  the  recognition  result.  How¬ 
ever,  the  selection  of  windows  is  only  loosely  coupled  to  the 
classification  pipeline.  In  this  paper,  we  aim  to  integrate  lo¬ 
cality  much  further  into  the  analysis. 

Starting  at  the  other  end,  the  route  of  first  segmentation 
then  recognition,  is  as  old  as  BlobworlEQbl,  describing  parts 
as  visually  coherent  regions.  I0D3]  the  regions  were 
jointly  modeled  to  establish  semantic  similarity  between  ad¬ 
jacent  object  parts.  And  indeed,  consistent  regions  lead  to 
useful  object  hypotheseQH]  5],  paving  the  way  to  object 
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segmentation.  Pushing  localization  with  state-of-the-art  en¬ 
codings  to  an  extremeQ81  classifies  pixels  with  a  Fisher 
kernel  for  semantic  segmentation,  as  application  of  Fisher 
on  regions  would  be  practically  infeasible.  We  will  argue 
in  this  paper  the  virtue  of  a  deeper  connection  between  spa¬ 
tial  localization  and  object  type  classification  using  state-of- 
the-art  pipelines,  such  as  the  improved  Fisher  kemdi~T^31  or 
explicit  feature  mapFT$21. 

We  aim  to  combine  semantic  segmentation  with  recog¬ 
nition  at  the  earliest  stage  of  the  analysis.  We  show  that  re¬ 
ordering  the  processing  steps  for  object  type  classification 
into  local  pooling  before  classification  has  considerable  ad¬ 
vantages.  WherirniTT)41  have  shown  the  efficiency  bene¬ 
fits  of  such  decompositions  for  unnormalized  bag-of- words 
with  a  linear  classifier,  our  codemaps  make  three  novel  con¬ 
tributions.  As  a  preliminary  result,  we  formulate  the  suffi¬ 
cient  mathematical  conditions  under  which  image  encoding 
and  classification  are  locally  decomposable  (Secti(0  3).  In 
the  first  novelty,  we  use  this  result  to  introduce  codemaps 
with  £2  normalization  for  arbitrarily  shaped  image  regions 
(Sectic{n|  4),  essential  to  reach  a  better  than  state-of-the-art 
performance  in  semantic  segmentatiofi~l[T|  51 .  In  the  second 
novelty,  using  the  same  lattice  across  images,  we  include 
nonlinearity  in  the  decomposition  by  local  kernel  pooling 
(Sectic[n|  5),  to  bring  us  on  par  with  the  state-of-the-art  in 
object  classificatiotm31.  but  3x  faster.  Thirdly,  we  demon¬ 
strate  the  effectiveness  of  codemaps  in  object  segment  re¬ 
trieval  from  a  single  query  image  (Secti(0  6). 

2.  Related  work 

We  structure  our  discussion  on  related  work  by  the  sub¬ 
sequent  steps  of  (localized)  object  type  segmentation  and 
classification:  semantic  segmentation,  feature  encoding, 
feature  pooling  and  kernel  classification. 

Semantic  segmentation.  For  semantic  segmentation 
two  main  approaches  have  been  adopted,  which  both  start 
from  an  initial  image  split  into  superpixels.  The  first  ap- 
proacCQb,  33]  then  tries  to  group  the  superpixels  on  the 
basis  of  semantic  similarity  in  a  conditional  random  field. 
Such  approaches  achieve  excellent  accuracy,  but  suffer  from 
a  high  computational  cost  due  to  complicated  training  and 
inference.  The  second  approacOJl]  5]  first  tries  to  form 
complete  segment  hypotheses  based  on  low  level  cues. 
Then,  these  segment  hypotheses  are  classified  with  a  stan¬ 
dard  object  classification  procedure.  In  the  current  work  we 
follow  the  second  family  of  approaches.  We  take  advantage 
of  the  complete  segment  hypotheses  being  composed  of  su¬ 
perpixels  to  enrich  the  segment  representation  with  state-of- 
the-art  image  classification  using  feature  encodings. 

Feature  encoding.  Feature  encodings  capture  the  vi¬ 
sual  information  around  a  local  neighborhood  and  generate 
a  measurement,  which  is  supposed  to  be  invariant  to  acci¬ 
dental  circumstances,  such  as  illumination,  shade,  occlusion 


etc.  A  feature  encoding  uses  the  raw  pixel  data  from  a  lo¬ 
cal  neighborhood  to  generate  a  codFTl(001.  The  acquired 
code  is  projected  to  a,  usually,  higher  dimensional  space. 
The  most  popular  projections  are  vector  quantizatioji_[26], 
soft  q  uantizatioh  HI  .351.  or  the  difference  of  projections  to 
pre-trained  models  captured  by  FishdrT%31  and  VLAI5T141 
vectors.  The  framework  we  propose  is  not  constrained  to 
any  particular  encoding  choice,  as  long  as  they  are  local.  In 
our  experiments  we  use  Fisher  vectors,  as  they  have  shown 
to  yield  state-of-the  art  results  in  object  classificatioiQ23] 
and  retrieved!]  4] . 

Feature  pooling.  The  feature  pooling  spatially  aggre¬ 
gates  the  relevant  local  feature  encodings  into  a  global  im¬ 
age  representation.  Average  pooling  has  been  shown  to 
work  best  for  bag-of- wordrTl61  and  Fisher  vectorlTTl31. 
Max-pooling  is  proven  effective  for  sparse  codin gH 51  and 
deep  leaminj~Tl51.  Sum  pooling  has  been  the  preferred 
choice  for  VLAD  vector|T[l4].  In  this  paper  we  general¬ 
ize  on  the  pooling  functions.  We  show  that  pooling  over  a 
region  of  interest  is  equivalent  to  a  simpler  two-level  pool¬ 
ing  for  a  particular  family  of  mathematical  functions.  This 
two-level  pooling  allows  to  classify  objects  locally,  while 
offering  a  substantial  efficiency  speed-up. 

Kernel  classification.  For  object  type  classification, 
support  vector  machines  have  repeatedly  shown  to  outper- 
forrh  19.1 071  all  other  alternatives.  To  cope  with  the  grow¬ 
ing  number  of  images,  the  size  of  the  image  representations 
and  the  numbers  of  object  types,  recent  research  has  focused 
on  efficient  learning  and  classification.  Kernel  properties, 
like  additivity  and  homogeneity,  of  support  vector  machines 
have  been  exploited  for  speed-ups,  especially  for  nonlinear 
kerneinHQ21.  Classifiers  currently  employ  the  local  ori¬ 
gin  of  the  data  only  weaklfTQIE39] .  In  this  work  we  modu¬ 
larize  linear  and  nonlinear  kernels  to  arrive  at  an  object  type 
decision  for  a  local  neighborhood  level. 

We  are  inspired  b]/  f  117, 141  who  observe  that  the  image 
interpretation  of  unnormalized  bag-of-words  with  a  linear 
classifier  can  be  analyzed  in  terms  of  the  contributions  of 
individual  descriptors,  leading  to  a  considerable  efficiency 
gain.  They  don’t  seem  to  realize  that  the  decomposition  can 
be  generalized,  as  we  do,  to  reorder  the  steps  of  general 
object  classification  pipelines,  including  for  example  Fisher 
vectors  and  VLAD.  By  doing  so,  we  obtain  a  joint  formu¬ 
lation  of  the  classification  score  and  the  local  neighborhood 
it  belongs  to.  Furthermore,  our  generalized  framework  can 
obtain  the  precise  £ 2  normalized  classification  score  for  any 
region,  which  is  known  to  increase  the  accuracy  of  both 
Fisher  vectors  and  VLAD  considerablFTlPhl  31 .  Last  but 
not  least,  we  propose  kernel  pooling  which  embeds  nonlin¬ 
earities  by  explicit  or  approximate  feature  mappings  rilJ21 
to  assure  state-of-the-art  competitivenesET^ 31 .  We  call  our 
approach  codemaps  and  we  will  now  highlight  its  theoreti¬ 
cal  foundation  in  a  preliminary. 
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3.  Preliminary 

We  start  from  a  lattice,  composed  of  N  nodes,  G  = 
{gi},i  =  1, N,  superimposed  on  an  image.  To  ensure 
good  generalization  and  flexibility,  we  consider  that  a)  each 
node  gi  of  the  lattice  is  arbitrarily  sized,  shaped,  and  non¬ 
overlapping,  i.e.  gi  H  gj  =  0,\/gi,gj  e  G,i  /  j,  and 
b)  each  area  R  where  we  search  for  the  objects  of  interest 
are  composed  of  multiple  nodes  R  =  g\  U  ...  U  gi.  Thus, 
regions  are  also  arbitrarily  sized  and  shaped.  Hence,  the 
image  search  is  no  longer  confined  to  specific  and  limiting 
templates,  such  as  rectangular  areaETim 91 .  For  ease  of 
reading  we  shall  refer  to  each  node  gi  of  the  lattice  as  lex. 
Our  theory  holds  for  all  types  of  patches,  including  cells  on 
a  regular  lattic018],  generalized  image  regionQ^],  super- 
pixelE~ni97l41  or  any  other  type  of  localities. 

We  extract  a  collection  of  local  features  z^i  =  1, ...,  M 
in  the  image  and  encode  them  to  codes  c*,  1, ...,  M.  The 
pooling  function  h(R)  combines  these  local  codes  within 
the  region  R  to  arrive  at  its  global  feature  encoding. 

Codemaps.  We  define  a  codemap  as  a  decomposed  ob¬ 
ject  image  representation  =  (G,</>),  where  fii  G  = 
1, ...,  N  denotes  the  object  evidence  per  lex  gim 

We  begin  with  unnormalized  codemaps.  Hence,  fii  = 
f(h(gi))  stands  for  the  per  lex  classification  score  using 
classifier  function  /.  For  an  image  region  R  the  correspond¬ 
ing  classification  score  can  be  written  as: 

f(h(R))  =  f(h(giU  g2U  ...U  gi)).  (1) 

Formally,  the  property  of  a  codemap  can  be  described  by 
f(h(R))  =  9(/(ft.(5i)),  f(h(g2)),  f(h(gi))),  (2) 


where  q  is  a  classification  pooling  function  that  aggregates 
the  localized  classifier  decisions  over  a  region  of  interest. 

From  ecQ2)  we  see  that  the  pooling  function  h  needs  to 
be  applied  to  each  of  the  lexes  gi  separately.  Taking  into 
account  ecQl),  we  arrive  at  the  first  condition  for  obtaining 
a  valid  codemap: 

Condition  1  The  pooling  function  h  :  A  B  must  be 
homomorphic  from  the  space  A  to  space  B,  so  that 


h(R)  =  h(UA 


gi,g2,-,gi 


) 


=  bir 


h(gi),h(g2),--,h(gi) 


(3) 


where  A  refers  to  the  spatial  domain  formed  by  lexes  {gi}, 
and  B  refers  to  the  code  pooling  space  defined  by  h.  In 
eqCl3)  the  Ua  stands  for  the  union  set  operation,  that  is 
IAa  =  U(#i, #2?  •  ••,#/),  whereas  Ub  is  an  operation  in  B 
that  makes  h  homomorphic.  When  h  stands  for  sum  pooling 
or  max  pooling,  Ub  is  the  sum  operator  or  max  operator. 


In  practice  a  homomorphic  pooling  means  that  we  can  first 
locally  pool  the  encodings  from  each  lex  gi  separately,  then 
combine  them  to  get  the  global  feature  encoding  as  if  we 
operated  on  R  in  the  first  place. 

In  addition,  we  want  the  classifier  /  to  also  operate  on 
each  of  the  lexes  gi  individually.  By  combining  eqQl) 
anf[l3),  we  arrive  at  the  second  condition: 

Condition  2  The  classification  function  f  :  B  C  must 
be  homomorphic  from  the  space  B  to  space  C,  so  that 


f{h{R))  =  f(uB  h(gi),h(g2),..,h(gi)  ) 


=  Uc 


(4) 


where  C  refers  to  the  classification  space.  Having  a  ho¬ 
momorphic  function  for  the  classifier  /,  one  only  needs  to 
consider  the  individual  scores  of  the  lexes  within  R. 

Normally,  when  classifying  a  region  we  first  perform  a 
global  pooling  on  all  the  feature  encodings  contained  in  the 
region,  and  then  we  apply  the  classifier.  However,  accord¬ 
ing  to  ConQ]  1,  codemaps  first  break  the  global  pooling  into 
a  collection  of  local  feature  poolings  over  lexes.  Then,  ac¬ 
cording  to  Confi]  2,  codemaps  apply  the  classifier  on  the 
local  feature  poolings  and  perform  a  global  pooling  on  the 
classification  scores  of  the  lexes.  Hence,  the  global  pooling 
is  performed  on  single  scalars  instead  of  high  dimensional 
vectors.  This  brings  significant  efficiency  benefits  for  vision 
tasks  where  thousands  of  regions  need  to  be  classified  per 
image,  such  as  in  semantic  segmentation. 

We  conclude  that  if  ConQ]  1  ar0  2  are  met,  we  obtain  a 
valid  codemap. 

4.  £2  normalization  for  arbitrary  regions 

Modern  feature  encodings,  such  as  Fisher  vector,  VLAD 
or  bag-of- words,  usually  include  a  summation  operator  in 
the  feature  pooling  function  h.  When  a  linear  classifier 
/  is  used,  the  classification  score  of  a  region  is  y(R)  = 
w  Th(R)  =  EdT,gieR  Wdhd(gi),  where  w  denotes  the 
learned  d  dimensional  weights  by  the  linear  classifier.  This 
leads  to  a  valid  codemap,  since 

i 

y{R)  =  ^2v{gi),  (5) 

i=  1 

where  y{gi)  =  ^2dWdhd{gi)-  A  similar  decomposition 
was  derived  ijmfP41  for  the  specific  case  of  unnormal¬ 
ized  bag-of- words  with  a  linear  SVM.  However,  feature  en¬ 
codings  also  profit  highly  from  normalization  before  classi- 
fication~B37l21.  Including  normalization,  in  particular 
for  variable  spatial  regions  is  difficult  to  do  efficiently.  We 
propose  £2  normalization  for  arbitrary  regions  in  codemaps. 
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In  general  for  a  region  R  the  norm  of  its  feature  encoding 
vector  h(R)  is  \\Lr  || .  Because  of  the  linear  classifier  we  can 
rewrite  the  normalized  classification  score  as: 

!'<,i)  =  /<ra-'*(-R))  =  ES'/('‘(s))'  <6) 

As  ecQ6)  indicates,  to  obtain  the  normalized  classification 
score  we  can  postpone  the  scaling  by  the  inverse  norm 
until  after  the  classification  pooling.  Thus,  with  codemaps, 
normalization  boils  down  to  multiply  by  a  scalar  on  the  clas¬ 
sification  score  of  a  region  instead  of  the  high  dimensional 
feature  encodings.  Then  we  consider  the  £ 2  norm  of  the 
feature  encoding  for  a  region  R  within  an  image,  since  the 
linear  classifier  prefers  £ 2  normalization.  It  is  equal  to  the 
square  root  of  the  dot  product  of  h(R)  with  itself: 


\\LR\\2  =  {h(R)Th(R))1/2 

=  (££%i)T%i) ) 

V=i  3= 1  / 


1/2 


(7) 


As  we  can  see  from  eqQ7),  to  calculate  the  £ 2  norm  of  a 
region  R  we  only  need  to  know  the  sum  of  the  pair-wise 
dot  product  h(gi)Th(gj )  between  feature  encodings  of  the 
lexes  within  the  region.  To  generalize  for  any  arbitrary  re¬ 
gion  R ,  we  calculate  the  dot  products  of  all  the  pair-wise  lex 
combinations  in  the  image.  Then,  we  only  need  to  consider 
the  combinations  of  lexes  that  both  appear  in  R ,  that  is: 


\\Lr\\2  = 


/  JV  JV  \  !/2 

££*  Ujh(gi)Th(gj)  ,  (8) 

V=i  i=i  / 


where  the  binary  vector  a  =  (ai, a/v)  indicates  whether 
each  lex  is  present  or  not  within  the  region  R.  Finally,  we 
compute  the  £2 -normalized  classification  score  of  an  arbi¬ 
trary  region  R  as: 


y(R) 


1 

\\Lr\\2 


N 

£<*iw'  Th{gi).  (9) 

i= 1 


We  describe  the  £2  normalized  codemap  of  an  image  as: 


Fisher  vector  by  cZi  =  Vm,<t  logiq^o-^).  Since  we  use 
the  sum  operator  for  the  feature  pooling  and  the  sum  op¬ 
erator  due  to  the  linear  classifier,  the  Cond  1  an|d]  2  are 
fulfilled  and  we  obtain  a  valid  Fisher  codemap.  We  ob¬ 
serve  in  Figure  2(|a)  that  Fisher  codemaps  allow  for  a  con¬ 
siderable  speed-up  when  classifying  a  number  of  arbitrary 
sized,  shaped  regions.  For  evaluating  1,000  regions  the 
Fisher  codemap  needs  23  seconds  per  image,  as  compared 
to  22  minutes  when  using  the  traditional  Fisher  vectors.  The 
speed-up  improves  further  when  more  evaluations  are  re¬ 
quired.  Moreover,  for  a  large  number  of  classes,  codemaps 
still  have  a  clear  advantage.  The  computation  of  the  normal¬ 
ization  matrix  h(gi)Th(gj),i, j  =  1,...,7V  for  codemaps 
is  shared  across  all  the  object  classes.  Therefore,  for  clas¬ 
sifying  1,000  object  categories  over  1,000  image  regions, 
the  Fisher  codemaps  are  still  45x  faster  than  the  Fisher  vec¬ 
tors.  We  conclude  that  our  £2  normalized  Fisher  codemaps 
are  mathematically  equivalent  to  Fisher  vectors,  but  much 
faster.  Similar  formulations  and  efficiency  benefits  can  be 
derived  for  other  feature  encodings,  e.g.  bag-of-words  or 
VLAD,  as  well. 

5.  Nonlinear  kernel  pooling  for  classification 

In  principle,  codemaps  work  with  arbitrary  lattices  for 
images.  We  now  approach  codemaps  from  a  kernel  point 
of  view,  aiming  to  introduce  nonlinearities  for  object  clas¬ 
sification  using  the  same  lattice  across  images.  Given  two 
codemaps  and  <&z  for  image  X  and  £,  the  similarity 
between  two  images  using  a  linear  kernel  is: 

KL{X,Z)  =  h{X)Th{Z) 

=  £  £  h(9x)Th{gz),  (11) 

gx  C:X  gzC:Z 

which  is  equivalent  to  the  sum  of  the  similarities  using  lin¬ 
ear  kernel  between  pair-wise  lexes.  Hence,  we  can  apply 
the  kernel  trick  and  use  more  sophisticated  kernels  to  mea¬ 
sure  the  pair-wise  lex  similarity.  In  that  case  the  similarity 
between  images  X  and  Z  becomes: 

K(X,Z)=  ]T  ^2  k(h(gx),h(gz)).  (12) 

gxCiX  gzC:Z 


$  =  {9i,wT%i),%i)r%i)},  (10) 

for  i,j  =  1 

Fisher  codemaps.  The  popular  Fisher  vector,  extracted 
from  a  Gaussian  mixture  model  with  a  probability  density 
function  u(;  |/z,  a)  is  equal  to  cZi  =  ^Vf.,<rlog  vfe), 
where  Mr  stands  for  the  number  of  local  descriptors  zi 
sampled  from  an  image.  A  codemap  is  independent  of  the 
regions  R ,  hence  the  value  of  Mr  is  not  available.  How¬ 
ever,  Mr  is  canceled  out  during  the  £2  normalization,  there¬ 
fore  we  propose  to  drop  the  constant  Mr  from  the  original 


K (A,  Z)  is  a  positive  definite  kernel  as  long  as  k  is  positive 
definite  as  wel|Qj5].  This  is  related  to  match  kernelfQ3] 
between  sets  of  local  features,  e.g.  SIFT,  but  we  consider 
kernels  between  lexes.  All  the  standard  kernels  from  the  lit¬ 
erature  are  applicable  for  k  in  ej77l2).  In  order  to  preserve 
ConQ]  1  ai0  2  of  codemaps,  we  opt  for  explicit  or  approxi¬ 
mate  kernel  mappings  [2 L, 25*32],  that  is: 

K{X,Z)  =  £  £  ^(h(9x))T^(h(gz)) 

g*exgzez  (13) 

=  h(X)Th(Z), 
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Figure  2.  Timing  and  memory  usages  for  codemaps.  All  the  experiments  are  done  on  the  PASCAL  VOC  images  using  a  Gaussian  mixture 
model  of  256  components,  (a)  £2  normalized  Fisher  codemaps  (with  400  lexes  per  image)  are  up  to  56x  faster  than  traditional  Fisher  vectors, 
depending  on  the  number  of  regions  analyzed  (note  the  log  10/  log  10  scales),  (b)  £ 2  normalization  for  arbitrary  regions  is  efficient.  For 
the  4—500  lexes  per  image  that  usually  suffice  for  semantic  segmentation  lr2|[57l33L  the  unnormalized  and  the  normalized  Fisher  codemaps 
are  practically  as  efficient,  but  the  normalized  Fisher  codemaps  are  much  more  effective  as  shown  in  Talfli  1.  (c)  Depending  on  the  number 
of  lexes,  computing  Fisher  codemaps  costs  up  to  600  MB  memory  per  image,  while  storing  them  only  needs  less  than  30  MB. 


where  h(X)  =  J29xex  indicates  the  nonlinear 

feature  pooling  function  for  image  X ,  which  is  the  sum  of 
the  set  of  pooled  lexes  applied  by  nonlinear  kernel  mapping 
i/j.  We  refer  to  2p(h(gi))  as  local  nonlinear  kernel  pooling. 
Thus  we  still  use  the  sum  operator  for  global  feature  pool¬ 
ing  and  a  linear  classifier,  which  lead  to  a  valid  codemap. 
For  normalization  we  need  K(X,  X)  =  1,  which  is  equiv¬ 
alent  to  use  £2  normalization  on  h(X).  Then  the  resulting 
codemap  with  nonlinear  kernel  pooling  is  defined  as: 

=  {fli,  wTV>(%i)),  V’(/i(5i))T^(/i(5i))},  (14) 

for  i,j  =  1 

Applying  nonlinear  kernel  pooling  for  each  lex  makes 
the  global  image  feature  encoding  h(X)  dependent  on  the 
partition  of  the  lattice  elements  placed  on  the  image.  There¬ 
fore,  to  ensure  a  consistent  image  representation  so  that  we 
can  measure  image  similarity  properly,  we  define  the  same 
lattice  across  all  the  images.  Consequently,  codemaps  with 
kernel  pooling  have  a  strong  connection  with  spatial  pyra¬ 
mid  kernels.  For  the  spatial  pyramid  kernel  we  compute  the 
similarity  of  each  lex  in  an  image  only  with  itself,  whereas 
for  codemaps  all  the  pair-wise  similarities  between  lexes 
are  considered.  Hence,  one  could  view  our  codemaps  with 
kernel  pooling  as  an  extension  of  the  spatial  pyramid  ker¬ 
nels.  However,  for  our  kernel  pooling  the  final  classifica¬ 
tion  score  is  computed  from  a  single  lattice  based  on  all 
the  partitions  of  spatial  pyramids  without  any  redundancy, 
where  spatial  pyramids  require  multiple  layouts.  As  a  re¬ 
sult,  codemaps  with  kernel  pooling  allow  for  the  inclusion 
of  richer  spatial  information  in  the  final  classification  score 
at  a  nearly  zero  cost. 

6.  Experiments 

We  demonstrate  the  efficiency  and  effectiveness  of  pro¬ 
posed  codemaps  by  experiments  on  semantic  segmentation, 


Bag-of-words  Fisher  codemaps 

Region  normalization  -  -  £2 

mAP  4.1  7.0  26.9 

Table  1.  £2  normalized  semantic  segmentation  for  arbitrary  re¬ 
gions  is  effective.  Mean  average  precision  on  the  val  set  of  the 
segmentation  task  in  the  PASCAL  VOC  2011,  where  -  indicates 
unnormalized  versions  over  regions. 

object  classification  and  segmented  object  retrieval.  Since 
these  tasks  all  require  repetitive  computations  on  overlap¬ 
ping  regions,  performing  them  once  with  £2  normalized 
codemaps  and  nonlinear  kernel  pooling  leads  to  a  consid¬ 
erable  speedup.  We  summarize  the  timing  and  memory  us¬ 
ages  in  Figu@  2. 

6.1.  £2  normalized  semantic  segmentation 

In  the  first  experiment  we  quantify  the  value  of  codemaps 
with  £2  normalization  for  semantic  segmentation,  where 
several  image  regions  need  to  be  evaluated  on  presence  of 
objects  and  their  type.  We  use  the  PASCAL  VOC  Segmen¬ 
tation  dataset  and  follow  the  training  protocol  of  CPMC- 
02iQ5],  which  combines  three  segmentation-tailored  and 
color-enhanced  features,  and  trains  linear  support  vector  re¬ 
gressors  based  on  the  overlaps  between  segments.  We  use 
the  Fisher  codemaps  from  Secti(0  4,  with  dense  sampling  of 
basic  intensity  SIFT  descriptors  per  pixel  at  multiple  scales 
and  a  Gaussian  mixture  model  of  128  components.  Note 
that  we  do  not  consider  any  feature- specific  optimizations 
for  the  purpose  of  semantic  segmentation.  We  also  consider 
the  unnormalized  Fisher  codemap  version  and  unnormal¬ 
ized  bag-of-words  features  using  a  visual  codebook  of  size 
4,000,  similar  to  the  ones  used  iQ4],  While  unnormal¬ 
ized  Fisher  vectors  have  been  used  for  pixel-level  segmen¬ 
tation]^],  we  are  unaware  of  segment-level  classification 
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mAP 

Bgnd 

Plane 

Bike 

Bird 

Boat 

Bottle 

Bus 

Car 

Cat 

Chair 

Cow 

Table 

Dog 

House 

M/bike 

Person 

P/Plant 

Sheep 

Sofa 

Train 

TV 

CPMC-OQ  [5] 

46.4 

84.7 

63.5 

23.4 

45.0 

40.8 

44.9 

59.1 

58.3 

57.1 

11.8 

42.9 

32.8 

45.2 

55.4 

56.6 

51.2 

35.6 

44.9 

30.3 

48.0 

42.5 

FGT_SECW\101 

47.5 

85.2 

63.4 

27.3 

56.1 

37.7 

47.2 

57.9 

59.3 

55.0 

11.5 

50.8 

30.5 

45.0 

58.4 

57.4 

48.6 

34.6 

53.3 

32.4 

47.6 

39.2 

DivMBest+ReraVik\22  ] 

48.1 

85.7 

62.7 

25.6 

46.9 

43.0 

54.8 

58.4 

58.6 

55.6 

14.6 

47.5 

31.2 

44.7 

51.0 

60.9 

53.5 

36.6 

50.9 

30.1 

50.2 

46.8 

0  [5 \+Fisher  codemaps 

48.3 

85.3 

66.2 

24.4 

47.5 

37.2 

52.4 

60.4 

61.1 

56.5 

12.8 

44.5 

32.9 

44.8 

60.8 

61.3 

55.8 

33.2 

49.8 

34.3 

47.9 

45.0 

Table  2.  State-of-the-art  semantic  segmentation.  Following  the  exact  protocol  (Q5]  we  show  semantic  segmentation  results  for  the 

PASCAL  VOC  2012  comp6  task.  Adding  normalized  Fisher  codemaps  on  top  of  the  CPMC-O2P  improves  the  state-of-the-art  in  semantic 
segmentation  for  8  out  of  21  object  categories.  Best  result  denoted  in  bold. 


Figure  3.  Semantic  segmentation.  Adding  normalized  Fisher 
codemaps  (bottom  row)  on  top  of  the  CPMC-O20l5]  (top  row) 
appears  to  be  beneficial  when  multiple  objects  appear  simultane¬ 
ously  in  the  image.  Note  for  example  the  difficult  case  in  the  last 
column,  where  codemaps  help  better  segmenting  the  motorbike  on 
the  poster  and  the  motorbike  in  the  right  part  of  the  image. 

with  normalized  Fisher  vectors. 

Fisher  codemaps.  We  first  consider  the  benefit  of  £2 
normalization.  We  use  the  VOC  2011  train  set  for  training 
and  we  report  results  on  the  val  set  in  Tab[e|  1 .  We  observe 
that  £ 2  normalized  Fisher  codemaps  outperform  the  unnor¬ 
malized  ones  by  far.  Fisher  codemaps  obtain  a  26.9  mAP 
(mean  Average  Precision),  where  the  unnormalized  Fisher 
codemaps  obtain  only  7.0  mAP.  While  unnormalized  Fisher 
codemaps  outperform  bag-of-words,  the  £2  normalization 
is  critical  for  linear  regression,  since  we  have  to  ensure  that 
the  overlap  between  each  segment  and  itself  is  largest  and 
equal  to  1 .  We  plot  in  Figu|re  2Qp)  how  efficient  it  is  to  com¬ 
pute  a  Fisher  codemap,  under  a  varying  number  of  lexes  in 
the  lattice.  Calculating  the  normalized  Fisher  codemap  is 
as  efficient  as  the  unnormalized  version  for  up  to  500  lexes. 
For  semantic  segmentation  in  particular,  since  4—500  lexes 
per  image  usually  sufficQ2][03],  calculating  the  £2  nor¬ 
malized  Fisher  codemaps  is  practically  as  efficient  as  the 
unnormalized  one,  but  much  more  accurate. 

CPMC-O2P  +  Fisher  codemaps.  Since  the  leading  se¬ 
mantic  segmentation  methods  use  multiple  features  to  cap¬ 
ture  several  aspects  of  the  object  information,  i.e.  i05] 
3  features  and  irQl]  58  features  are  used,  we  embed 
Fisher  codemaps  into  the  multi-feature  approach  of  CPMC- 
Q?7p51  to  improve  the  state-of-the-art  in  semantic  seg¬ 
mentation.  We  note  that  the  individual  features  of  CPMC- 


02P ,  i.e.  eSIFT,  eMSIFT  and  eLBP,  obtain  28.4,  31.0  and 
21.2  mAP  on  the  VOC  2011  val  set  respectively,  where 
Fisher  codemaps  score  26.9  without  any  optimizations  for 
semantic  segmentation.  In  this  experiment  we  use  the  addi¬ 
tional  training  set,  the  same  as  CPMC-Q2, 051,  and  report 
the  results  on  comp6  of  the  VOC  2012  challenge.  Since 
both  the  features  iQ5]  and  £2  normalized  Fisher  codemaps 
use  a  linear  regressor,  we  rely  on  late  fusion  with  linear 
weights  learned  on  the  val  set  to  combine  them.  We  also 
compare  against  the  best  reported  methods  so  faEQD2]. 
The  numerical  results  are  shown  in  Tabd  2.  Adding  Fisher 
codemaps  brings  more  precision  to  the  image  region  rep¬ 
resentations.  More  specifically,  the  semantic  segmentation 
accuracy  is  improved  for  8  out  of  the  21  object  categories 
(including  “background”).  In  Figu@  3  we  illustrate  some  of 
the  segmentation  results.  We  observe  that  Fisher  codemaps 
are  particularly  helpful  when  multiple  objects  are  present 
simultaneously.  We  conclude  that  a  combination  of  CPMC- 
02P  with  our  i2  normalized  Fisher  codemaps  improves  the 
state-of-the-art  in  semantic  segmentation. 

6.2.  Nonlinear  kernel  pooling  for  classification 

In  the  second  experiment  we  quantify  the  value  of 
codemaps  with  £2  normalization  and  nonlinear  kernel  pool¬ 
ing  for  object  classification.  We  use  the  PASCAL  VOC 
2007  Classification  dataselFTlll  for  both  bag-of-words  and 
Fisher  vectorFTt31.  We  sample  dense  SIFT  descriptors  ev¬ 
ery  two  pixels  at  multiple  scales.  We  use  a  visual  code¬ 
book  of  size  4,000  with  hard  assignment  for  the  bag-of- 
words  and  a  256  component  Gaussian  mixture  model  for  the 
Fisher  vectors.  We  employ  lxl,  2x2  and  3x1  spatial  pyra¬ 
mids.  Since  power  normalization  has  shown  to  work  par¬ 
ticularly  well  for  Fisher  vector|jj3],  we  implement  Fisher 
codemaps  with  local  Hellinger  kernel  pooling.  For  bag-of- 
words  the  x2  and  histogram  intersection  kernels  are  the  top 
performeiirm061 .  We  therefore  implement  bag-of-words 
codemaps  with  local  x2  and  histogram  intersection  kernel 
poolings  using  explicit  feature  mapE~H21.  While  both  be¬ 
have  similarly  compared  to  normal  bag-of-words,  we  report 
only  the  slightly  better  performing  histogram  intersection 
kernel.  The  numerical  results  are  shown  in  TabQ  3. 

For  both  bag-of-words  and  Fisher,  a  codemap  with  £2 
normalization  is  mathematically  equivalent  to  the  regular  £2 
normalized  linear  models.  Hence  the  results  for  the  linear 
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Bag-of-words 

Fisher 

Others 

Codemaps 

Others 

Codemaps 

1 2  normalization 

42.4 

42.4 

55.0 

55.0 

Nonlinearities 

54.9 

54.8 

61.6 

61.6 

Table  3.  Nonlinear  kernel  pooling  for  object  classification. 

Mean  average  precision  on  the  PASCAL  VOC  2007,  following  the 
protocol  ilnTt31.  Both  bag-of- words  and  Fisher  codemaps  with  the 
proposed  £ 2  normalization  and  nonlinear  kernel  pooling  have  the 
same  accuracy  as  the  state-of-the-art.  Where  the  best  Fisher  vec¬ 
tors  require  18  seconds  per  image  for  evaluating  all  20  classes,  our 
codemaps  require  only  6  seconds. 

classifier  are  identical.  When  we  add  nonlinearities  to  the 
bag-of-words  by  means  of  a  histogram  intersection  kernel 
the  results  increase  from  42.4  to  54.9  mAP.  With  histogram 
intersection  kernel  pooling  using  approximate  feature  maps 
we  obtain  practically  the  same  result  for  codemaps:  54.8 
mAP.  Fisher  vectors  outperform  the  bag-of-words  to  a  max¬ 
imum  results  of  61.6  mAP  using  £ 2  normalization  and  a 
power  norm  nonlinearity.  With  Hellinger  kernel  pooling  we 
reach  the  same  result.  However,  our  codemaps  only  need  a 
single-resolution  lattice,  as  compared  to  the  multiple  lattices 
required  by  spatial  pyramid  kernels.  Hence,  we  can  evalu¬ 
ate  an  image  for  all  20  classes  in  6  seconds,  where  Fisher 
vectors  require  18  seconds.  Since  it  also  costs  around  6  sec¬ 
onds  for  Fisher  vectors  to  test  an  image  without  any  spatial 
pyramids,  codemaps  can  include  full  spatial  pyramids  with 
nearly  zero  additional  cost,  but  increase  the  mAP  from  57.4 
to  61.6.  Moreover,  in  18  seconds  per  image  we  can  also  em¬ 
bed  RGB-SIFT  and  OpponentSIFT  frouTOOl  in  a  colored 
Fisher  codemap,  by  simple  average  fusion  resulting  in  64.1 
mAP  Codemaps  with  our  proposed  £2  normalization  and 
nonlinear  kernel  pooling  are  as  good  as  the  state-of-the-art, 
but  3x  more  efficient  to  compute. 

6.3.  Codemaps  for  segmented  object  retrieval 

In  the  last  experiment  we  take  advantage  of  the  effi¬ 
ciency  benefits  of  £2  normalized  codemaps  to  revisit  the 
old  challenge  of  object  segment  retrievaDfil  and  we  sug¬ 
gest  a  new  solution.  We  propose  to  apply  codemaps  for 
segmented  object  retrieval  in  a  query -by-example  setting. 
We  use  the  query  images  from  the  instance  search  task  of 
the  TRECVID  2012  Jj7]  video  retrieval  benchmark.  Note 
that  we  do  not  intend  to  embed  in  the  regular  setting  of  full 
image  instance  search,  but  to  explore  whether  image  regions 
can  be  retrieved  using  a  single  query  image  only.  We  extract 
normalized  Fisher  codemaps  in  the  same  way  as  the  pre¬ 
vious  experiment.  As  lexes  we  use  the  superpixel  regions 
frorQ2]. 

We  use  the  feature  encoding  of  the  segmented  query  as 
a  classifier  function.  For  retrieval,  the  cosine  similarity  is 
measured  by  a  dot  product  between  the  £2  normalized  query 


Figure  4.  Segmented  object  retrieval.  On  the  left  of  the  gray  line 
we  have  the  query  image  with  the  user- specified  object  of  inter¬ 
est.  On  the  right  the  top  two  retrieved  results,  with  the  top  three 
estimates  for  the  regions  of  interest.  The  white  striped  lines  de¬ 
note  the  ground  truth.  Although  only  a  single  query  example  is 
used,  we  can  identify  the  segmented  object  of  interest.  This  is  es¬ 
pecially  noteworthy  for  the  Brooklyn  Bridge  (top  row),  where  the 
segmented  query  is  viewed  sideways  and  the  retrieved  segments 
are  photographed  from  a  frontal  view  and  varying  distances. 

and  the  retrieved  images.  We  proceed  on  searching  for  those 
groups  of  lexes  that  maximize  this  cosine  similarity.  For  the 
search,  we  devise  a  simple  greedy  algorithm.  We  first  look 
for  those  lexes  most  similar  to  the  segmented  query,  as  the 
seeds.  Then,  we  grow  iteratively  each  of  these  seeds  with 
neighboring  lexes  that  contribute  to  a  larger  cosine  similar¬ 
ity,  until  neighbors  no  longer  contribute. 

We  perform  the  segmented  object  retrieval  using  each 
image  in  the  dataset,  with  its  given  binary  mask,  as  a  query. 
Our  evaluation  criterion  for  this  experiment  is  the  intersec¬ 
tion  over  union  overlap  between  the  segmented  query  and 
our  top  guesses  for  the  top  two  retrieved  images.  Similar 
tQ9],  we  select  the  top  1  and  3  guesses  per  retrieved  im¬ 
age.  When  we  select  only  the  top  one  guess,  the  average 
overlap  is  0.24  and  25%  of  the  images  pass  the  PASCAL 
criterion  that  requires  at  least  50%  overlafTQ  1].  When  we 
select  the  top  three  guesses  the  average  overlap  accuracy  in¬ 
creases  to  0.27,  while  30%  of  the  retrieved  images  pass  the 
PASCAL  criterion.  We  plot  the  top  three  guesses  for  our 
five  random  query  images  in  Figu0  4.  Although  no  object 
classifier  is  available  at  query  time,  codemaps  find  satisfac¬ 
tory  segments  in  the  retrieved  images  using  a  single  query 
image  only. 
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7.  Conclusions 

In  this  paper,  we  propose  codemaps  to  segment,  classify 
and  search  objects  locally.  Codemaps  reorder  the  encod¬ 
ing,  pooling  and  classification  steps  of  object  classification. 
They  do  so  by  exploiting  the  homomorphic  properties  of 
the  sum  operator  and  grouping  of  local  neighborhood  scores 
over  lattice  elements.  Our  first  contribution  is  introduction 
of  codemaps  with  1 2  normalization  for  arbitrarily  shaped 
image  regions.  Depending  on  the  number  of  regions  ana¬ 
lyzed  the  normalized  codemaps  are  up  to  56x  faster  than 
traditional  Fisher  vectors.  The  fast  normalization  enables  us 
to  reach  a  better  than  state-of-the-art  performance  in  seman¬ 
tic  segmentatio05]  by  inclusion  of  Fisher  codemaps.  Our 
second  contribution  is  the  embedding  of  nonlinearities  in 
the  codemap  decomposition  by  local  kernel  pooling.  When 
using  the  same  lattice  across  images,  it  allows  us  to  incorpo¬ 
rate  proven  effective  explicit  and  approximate  feature  map- 
pindsT$21.  The  contribution  brings  us  on  par  with  the  state- 
of-the-art  in  object  classificatiojalj3],  but  3x  faster.  Finally, 
we  demonstrate  that  the  efficiency  gains  of  codemaps  fa¬ 
cilitate  object  segment  retrieval  from  a  single  query  image. 
Besides  segmentation,  classification  and  search,  we  antici¬ 
pate  that  other  computer  vision  challenges  may  profit  from 
codemaps  as  well. 
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